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1. INTRODUCTION 

In recent years, the data is exponentially expanded, so their characteristics, therefore, reducing the 
size of the data by removing variables that are irrelevant or that are redundant and selecting only the most 
significant according to some criterion has become a requirement before any classification, this reducing 
should give the best performance according to some objective function [1]-[5]. DNA microarray technology 
has the ability to study thousands of genes simultaneously in a single experiment. This technology provides a 
large amount of data from which much knowledge can be processed. A set of microarray gene expression 
data can be represented in tabular form, in which each line represents a particular gene, each column a 
sample and each entry of the matrix is the measured level of expression gene in a sample. Researchers have a 
database of more than 40,000 gene sequences that they can use for this purpose. Unfortunately, the enormous 
size of DNA microarray causes a problem when it treated by clustering or classification algorithms such as 
SOM, K-means, KNN ... or other; so pre-processing the data beforehand by reducing its size becomes a 
necessity. Feature selection consists of choosing a subset of input variables and deleting redundant or 
irrelevant entities from the original dataset. Consequently, the execution time for classification the data 
decreases, and the accuracy increases [6]. 

Feature selection algorithms are divided into three categories; filters, wrappers and embedded or 
hybrid selectors [7], [8]. The filters extract features from the data without any learning involved by ranking 
all features and chosen top ones [9]-[11]. There were several and widely used filter in literature, like: 
Information Gain (IG) [12] that ranks features based on a relevancy score which is based on each individual 
attribute. Correlation-based Feature Selection (CFS) algorithms looks for features that are highly correlated 
with the class which has no or minimal correlation with each other (Hall, 2000). Minimum Redundancy 
Maximum Relevance (mRMR) [8] that maximizes the relevancy of genes with the class label and minimizes 
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the redundancy in each class using Mutual Information (MI) mesures. Relief F is also widely used with 
cancer microarray data [13]; it detects features which are statistically relevant to the target concept. 

The wrappers uses classifying algorithm to evaluate which features are useful; it means that the 
features were selected taking the classification algorithm into account [14]. Many researches have applied 
wrappers selector, like study of Gheyas and Smith that proposed a new method named simulated annealing 
generic algorithm (SAGA), which incorporates existing wrapper methods into a single solution [15]. LDA- 
based Genetic Algorithm (LDA-GA) proposed by Huerta et al in [16]; this method applied t-statistic filter to 
retain a group of p top ranking genes, and used the LDA-based GA. Leave-one-out calculation sequential 
forward selection (LOOCSFS) algorithm that combine the leave-one-out calculation measure with the 
sequential forward selection scheme proposed by Tang et al [17]. Genetic Algorithm-Support Vector 
Machine (GA-SVM) creates a population of chromosomes as binary strings that represent the subset of 
features that are evaluated using SVMs developed by Perez and Marwala in [18]. 

The third field of feature selection approaches is embedded methods. It takes advantage of the two 
models by using their different evaluation criteria in different search stages [19]. In this case we can cite the 
most widely applied embedded techniques based on support vector machine based on Recursive Feature 
Elimination (SVM-RFE) for gene selection and cancer classification proposed by Guyon et al. in [20]. 
Maldonado et al. proposed an embedded approach called kernel-penalized SVM (KP-SVM) by introducing a 
penalty factor in the dual formulation of SVM [21]. Mundra et al. hybridized two of the most popular feature 
selection approaches: SVM-RFE and mRMR [22]. Chuang et al. proposed a hybrid approach that hybridize 
correlation based feature selection (CFS) and Taguchi-Genetic Algorithm (TGA) and used KNN as the 
classifier with the leave-one-out cross-validation (LOOCV) [23]. Lee and Liu [24] proposed an approach 
called Genetic Algorithm Dynamic Parameter (GADP) for producing every possible subset of genes and rank 
the genes using their occurrence frequency. 

Therefore, this paper attempts to present a review of widely used feature selection techniques 
focusing on cancer classification. In addition, other tasks related to microarray data analysis also have been 
revealed such as missing values, normalization and discretisation. Furthermore, commonly used classification 
methods were discussed. This study evaluated five different filter algorithms: Random forest, information 
gain and chi-squared on three cancer datasets; and evaluated their effect on three classification algorithm: 
SOM, KNN, K-means and Random Forest. 


2. METHOD AND MATERIALS 
2.1. General Bachground 

Analysis of gene expression data is primarily based on comparison of gene expression profiles. To 
do these, we need a measure to quantify the similarity between genes in expression profiles. A variety of 
distance measures can be used to compute similarity. In this section, a description of most metrics used is 
discussed. The gene expression data from microarray experiments is usually in the form of large matrices 
Gin+1)xm Of expression levels of genes 91, 92, ---, Jn under different experimental conditions sj, S2, ... Sm and 
the last row contains the label Y of each sample, their values y; E€ {—1,1}. Each element G[i, j], denoted 
as gij, represents the expression level of the gene g; in the sample s; (see Table 1). The expression profile of 
a gene i can be represented as a row vector: gi = (gi Jiz». =» Jim) as follow: 


Jı 911 `|“ Jim 
= In = Ini ane Inm 
Y Yı = Ym 


Table 1. Microarray Dataset Example 


Genes\Samples Sy Sy S3 Sy ak Sin 
gı 56,23 43,74 4,18 9,5 a 34,18 
92 33,54 30,5 4,71 32,18 Si 43,71 
93 13 29,09 4,13 2,88 n 49,13 
J4 64,25 70,24 76,1 31,4 a 36,91 
In 3,54 0,5 40,71 2,99 
Label : Y Normal ANormal Normal Anormal a Normal 
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Pearson correlation coefficient: (represented by the letter p), can be obtain by substituting covariances cov 
and variances ø based on a sample. So, for two genes g, and gz the formula for p is: 


patrun Yiz1(G1i- 91) (921-92) (2) 
VODO) SIR (yi? [BB (21a) 


where g; = De gıi and Jz = DR are the mean for gene g, and g, respectively. 

Mutual information (MI): It is a distance measure that compares genes whose profiles are discrete. It 
can be calculated using Shannon’s entropy. It has been used to measure the dependency between two random 
variables based on the probability of them. For two genes g, and gz, the mutual information between theme, 
I(g1, 92), can be calculated as follow: [25], [26]: 


1(91,92),= H(g1) — H(gil92) 
= H(g2) — H (92191) (3) 
= H(g1) + H(g2) — H (91, 92) 


where: H(g1), H(g2) are the Shannon’s entropies, expressed as follow: 


H (g1) = — Xi- Pui) X log2(P(g1:)) 


H (g1, 92) is the joint entropy of the g, and gz defined as follow: 
H (gu 92) = — Dey Efla P( gui gz) X l0g2 (P(x, 925) ) 

H(g2|91) is the conditional entropy of g, given g, . It can be calculated as follow: 
H(g2|91) = — Èj- Ei P(g 92;) x log, (P(goilgsi)) 


Noted that P(g,;) represent the probability mass function, it can be calculated, when gene g, is discrete, as 
follow: 


number of instants with value 94; 


P(91i) = 


total number ofinstants (n) 


and P( Jii 92 j) is the joint probability mass function of the gene g, and g3 


2.2. Feature Selection 

The goal of the feature selection is to select the smallest subset of features by scoring all features 
and using a threshold to remove features below the threshold. This process makes a classification problem 
simpler to interpret and reduces the time for training model. Mathematicly, for a feature set composed by all 
genes G; = {Iro Ifo eo CMe the feature selection process identifies a subset of features Sp with 


dimension k wherek < n, and Sf S Gr. In this study, five features selector algorithm were descussed, 
includes information gain, mMRMR, linear correlation and chi-squared. The choice of filter method instead of 
a wrapper one due to the huge computational costs when uses wrappers methods [2]. 

Information gain (IG): It is a filter method that ranks features based on high information gain 
entropy in decreasing order. It ranks features based on the value of their mutual information with the class 
label using equation 3. Simplicity and low computational costs are the main advantages of this method. 
However, it does not take into consideration the dependency between the features; rather, it assumes 
independency, which is not always the case. Therefore some of the selected features may carry redundant 
information. 

Chi-squared (Chi7): is a statistical test to determine the dependency of two events, it characterize by 
it simplicity to implement (In feature selection, the two events are occurrence of the feature and occurrence 
of the class). The process consists of calculation of Chi? between every feature variable gg and the label Y. If 
Y is independent of gfi , this feature variable will be discard. If they are dependent, this feature variable will 
be present into training model [27]. The initial hypothesis Hy is the assumption that the two features are 
uncorelated, and it is tested by Chi? formula as follow: 
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2 

2 or c (Wij-Fij) 

Chi? = let a 
Eij 


(4) 
where gj; is the observed frequency and E;; is the expected frequency under the null hypothesis. E;; can be 
computed by : 


_ row totla xcolumn total 


E 


J sample size 


The high value of Chi? indicates that the hypothesis of independence is incorrect and the feature is 
correlated with the class, thus it should be selected for model training. Linear correlation (Corr): (well-known 
similarity measure between two random variables) It can be calculated using Pearson correlation coefficient 
(p) as defined in equation 2. The resulting value is in [—1;1], with -1 meaning perfect negative correlation 
(as one variable increases, the other decreases), +1 meaning perfect positive correlation and 0 meaning no 
linear correlation between the two variables [28]. 

minimum Redundancy-Maximum Relevancy (mRMR): The mRMR filter method selects genes with 
the highest relevance and minimally redundant with the target class [8], [29]. mRMR of genes are based on 
mutual information using equation 3. The Maximum Relevance method selects the highest top k genes, 
which have the highest relevance correlated to the class labels from the descent arranged set of I(g;,Y), 
equation 5. Minimum Redundancy criterion is introduced by [14] in order to remove the redundancy features; 
this criterion defined by Equation 6. 


1 
Gel 


Zacco !Gis¥)) 6) 


max ( 


min (Eassa an) (6) 


The (mRMR) filter takes the mutual information between each pair of genes into consideration and combines 
both optimization criteria of equation 5 and 6. 


2.3. Classifiers 
In this part, a brief description of commonly classifier algorithms used for classification task. Table 
2 shows the parameters used for each classifier. 


Table 2. Table Parameters of Classifier 
Classifier Parameter 
K=2:9 
Distance = Euclidean distance; 
Distance = Euclidean distance; 
KNN Number of nearest neighbors = 5 
Kernel= rectangular 
Number of input neurons = 10x10 
Learning rate = 0.9 
Radius = 20 
Distance Metric = Euclidean 
Initialization = Random 
Number of iteration = 1000 
Random Number of trees: 500 
Forest Number of variables tried at each split: 10 


K-means 


SOM 


K-means: is a clustering algorithm or unsupervised classification which divides observations into k 
clusters [30]-[32]. It can be adapted for supervised classification case by dividing data into equal to or 
more than the number of classes. It takes a set S of m samples and the number of clusters K as input, 
and outputs a set C = {c,Cp,...,c,} of K centroids. The algorithm starts by initialising randomly all 
centroids; then, it iterates between two steps until a stopping criteria is done (often, the maximum 
number of iterations is reached). In the first one, each sample s; is assigned to its nearest centroid Cx, 
based on the distance measure between s; and cx as follow: 
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generating a set S¢ formed by sample assignments for each k” cluster centroid. In the second step, 
each centroid c, is updated based on the mean of all samples assigned to their Sf as follow: 


1 
c= ape Sj (8) 


Self-organizing maps (SOM): SOM is commonly used for visualizing and clustering of 
multidimensional data, due to his ability to project high-dimensional data in a lower dimension [33]-[37]. 
The SOM often consists of a regular grid of map units. Each unit is represented by a vector W; = (Wj1,Wj2, 
~ , Wim), where m is input sample dimension. The units are connected to adjacent ones by neighbourhood 
relation. The SOM iteratively trained. At each training step, a sample input S is randomly chosen from the 
input data set, a metric distance is computed for all weight vectors W; to find the reference vector Wom 
(called Best Matching Unit (BMU) that satisfies a minimum distance or maximum similarity criterion 
following the Equation 9. 


bmu(t) = argminisisnllS (t) — W;(t)|| (9) 


Where n is the neurons number in the map. The weights of the bmu and its neighbours are then adjusted 
towards the input pattern, following equation: 


W(t + 1) = Wi) + Bomui OllS — Will (10) 


where Ppmau,i is the neighbourhood function between the winner neuron bmu and neighbour neuron i. It is 
defined by the equation (11). 


Pomui(t) = exp (=| (11) 


20? (t) 


Wherell7pmu — “Il = |m om — w'|| , Tymy and rare positions of the BMU and neuron i on the Kohonen 
topological map. The o(t) decreases monotonically with time. 

K nearst neighbours (k-NN): is a non-parametric method used for classification [38]-[40]. The 
process begins by calculating similarity distance Douc (s; si) between test sample s; and a set of training 
samples s; and it sorts the distances in ascending (or descending) order. Then, it selects k closest 
neighbours to the sample s;, and it gathers them together. To predict the class of this sample, it uses 
the majority voting: the class that occurs the most frequently in the nearest neighbors wins. 

Random forest (RF): can be supposed of as a form of nearest neighbor predictor. It creates a set of 
decision trees from randomly selected subset of original training set; and sums the votes from different 
decision trees to decide the final class of the test object. It is considered well suited to situations characterized 
by a large number of features [41]-[43]. 


2.4. Datasets Descreption 

In this study, the following published datasets was used (a brief description exists in Table 2). The 
first one is ALL/AML leukemia proposed by Golub et al in 1999 [3]; these data contains 7129 genes and 72 
samples splits in two classes. It was used to classify patients with acute myeloid leukemia (labelled as AML) 
25 examples (34.7%) and acute lymphoblastic leukemia (labelled as ALL) 47 examples (65.3%). The second 
dataset is Colon cancer dataset [44] that contains 62 samples. Among them, 40 tumor biopsies are from 
tumors (labeled as “N”) and 22 normal (labeled as "P") biopsies are from healthy parts of the colons of the 
same patients. The total number of genes to be tested is 2000. The third dataset is Lymphoma Cancer Data 
Classification [45]; it includes 45 tissues and 4026 genes. The first category, Germinal Centre B-Like 
(labelled as GCL) has 23 patients, and the second type Activated B-Like (labelled as ACL) has 22. The 
problem is to distinguish the GCL samples from the ACL samples. This data contains about 3.28% missing 
values. 

Before applying any learning algorithm, the data must be pre-processed by several processes as 
missing values imputation, noisy data elimination, and normalizing data. Missing values: In general, dataset 
contains missing values occuring due to a variety of reasons including hybridization failures, artifacts on the 
microarray, insufficient resolution, image noise and corruption, or they may occur systematically as a result 
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of the spotting process. There are many techniques to handle these missing values such as omitting the entire 
record which contains the missing value or impute themes by Median, Mean, K-NN [46]. Data 
Normalization: Some algorithms, such as K-means and K-NN, may require that the data be normalized to 
increase the efficacy as well as efficiency of the algorithm. The normalization will prevent any variation in 
distance measures where the data may not been normalized. Normalizing the attribute will place all attribute 
within a similar range, usually [0, 1] [47]. Data Discretization: Discretization is the process of converting 
continuous variables into nominal ones. Studies have shown that discretization makes learning algorithms 
more accurate and faster [48]. The process can be done manually or by predefining thresholds on which to 
divide the data [49]-[51]. In this study, the percentage of missing values in our data set is less than 5%, 
which leads us to impute the missing values by the mean; and all data were normalized to zero. Then, gene 
expression values were directly used as input characteristics for classifiers. The framework of our process is 
described in Figure 1. 


Feature selection: 
= RFS Classification 
IG algorithm : 
SOM 
K-means 
Random 
forest 


KNN 


Figure 1. Framework used in this research 


Before applying any learning algorithm, the data must be pre-processed by several processes as 
missing values imputation, noisy data elimination, and normalizing data. Missing values: In general, dataset 
contains missing values occuring due to a variety of reasons including hybridization failures, artifacts on the 
microarray, insufficient resolution, image noise and corruption, or they may occur systematically as a result 
of the spotting process. There are many techniques to handle these missing values such as omitting the entire 
record which contains the missing value or impute themes by Median, Mean, K-NN [46]. Data 
Normalization: Some algorithms, such as K-means and K-NN, may require that the data be normalized to 
increase the efficacy as well as efficiency of the algorithm. The normalization will prevent any variation in 
distance measures where the data may not been normalized. Normalizing the attribute will place all attribute 
within a similar range, usually [0, 1] [47]. Data Discretization: Discretization is the process of converting 
continuous variables into nominal ones. Studies have shown that discretization makes learning algorithms 
more accurate and faster [48]. The process can be done manually or by predefining thresholds on which to 
divide the data [49]-[51]. In this study, the percentage of missing values in our data set is less than 5%, which 
leads us to impute the missing values by the mean; and all data were normalized to zero. Then, gene 
expression values were directly used as input characteristics for classifiers. The framework of our process is 
described in Figure 1 


3. RESULTS AND ANALYSIS 

In this study, five features selector were tested on four different classifiers using three gene 
expression datasets labeled Leukeimia, Colon and Lymphoma short description in Table 3. Classification 
accuracies are presented before and after the feature selection in Table 4.The columns named ALL, RFS, IG, 
Chi-2, Corr, mRMR present the accuracy values of classification using all features, Random Forest Selector, 
Information Gain, Chi-square, linear Correlation and Minimum Redundancy Maximum Relevance filters. 
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Table 3. A Brief Summary of Datasets Used 


Dataset No. of No. of No. of classes 
examples features Class 1 Class 2 
Leukeimia 72 7129 47(ALL) 25(AML) 
Colon 62 2000 40(P) 22 (N) 
Lymphoma 47 4026 22(ACL) 23(GCL) 


Table 4. Effects of Feature Selection on Classifiers Using 100 Important Features 
Classification Accuracy % 


Classifier Dataset name 


ALL RFS IG Chi-2 Corr mRMR 

Leukeimia 89.28 96.43 92.86 95.24 94.29 100 

K-NN Colon 78.01 87.22 86.82 87.77 88.88 85.63 
Lymphoma 93.33 100 100 100 100 100 

Leukeimia 84.72 98.61 98.61 97.22 97.22 98.61 

K-means Colon 79.03 87.09 88.70 88.70 88.70 90.32 
Lymphoma 82.22 93.33 97.7 100 100 100 

Leukeimia 93.05 91.66 94.44 87.5 95.83 94.44 

SOM Colon 88.70 98.38 96.77 93.54 96.77 95.16 


Lymphoma 87.68 88.88 95.55 93.33 95.55 97.77 

Leukeimia 97.11 98.55 97.76 98.63 97.13 97.13 
Colon 83.39 88.40 86.50 88.73 86.82 85.06 

Lymphoma 90.74 98.33 95.16 93.05 93.16 100 


Random 
Forest 


The k-Nearest Neighbor (k-NN), Self-organizing maps (SOM), K-means and Random Forest were 
used as classifiers in the experiments, and the accuracy of five filters: Random Forest Selector, Information 
Gain, chi-square, linear correlation and Minimum Redundancy Maximum Relevance, when the top 100 
features are selected are compared between them. 

The choice of filters is due to the enormous size of the datasets used which increases the calculation 
time. For the k-NN classifier, we used the Euclidean distance as the distance metric, and the best k between 2 
and 9; the same thing for K-means. For SOM, we used the parameters as follow: (10x10) input neurons, 0.9 
Learning rate, Euclidean Distance Metric, all the neuron were initialized in random and 1000 as Number of 
iteration. For Random Forest, we used number of trees equal to 500 and the number of variables tried at each 
split is 10. The summarized description is in Table 2. 

The result of this work in Table 4 and in Figures 2 ,3, 4 and 5 shows a very important effect of the 
selection of variables on the classification rate (the top 100 features in this experiment). From the table we 
can observe that mRMR and FRS are a little better on the Leukeimia dataset than other methods if used with 
K-NN and K-means, with a great improvement over the use of all variables in classification. For Lymphoma 
dataset, all the selectors work very well with all classifiers, with the exception of SOM which is suitable with 
mRMR, and FRS, and still, there is an improvement over the use of ALL features. For the colon dataset, the 
classification rate is always low in all cases, with an improvement when using SOM as classifiers and RFS as 
filter. 


105 
100 + 
95 
90 
85 


=f Leukeimia 


=—t= Colon 


=> Lymphoma 


75 T T T T T 
ALL RFS 1G Chi2 Corr mRMR 


Figure 2. Effects of feature selection on KNN using 100 important features 
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Figure 4. Effects of feature selection on SOM using 100 important features 


CONCLUSION 
Feature selection is an important issue in classification, because it may have a considerable effect on 


accuracy of the classifier. It reduces the number of dimensions of the dataset, so the processor and memory 
usage reduce; the data becomes more comprehensible and easier to study on. In this study we have 
investigated the influence of feature selection on four classifiers SOM, K-NN, K-means and Random Forest 
using five datasets. So by just using 100 top features, the classification accuracy is improved up to 9% 
comparing to all feature, and the complexity and the training time were reduced. 
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